Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
1.
bioRxiv ; 2024 Jan 29.
Artículo en Inglés | MEDLINE | ID: mdl-38352303

RESUMEN

Polygenic scores (PGSs), increasingly used in clinical settings, frequently include many genetic variants, with performance typically peaking at thousands of variants. Such highly parameterized PGSs often include variants that do not pass a genome-wide significance threshold. We propose a mathematical perspective that renders the effects of many of these non-significant variants random rather than causal, with the randomness capturing population structure. We devise methods to assess variant effect randomness and population stratification bias. Applying these methods to 141 traits from the UK Biobank, we find that, for many PGSs, the effects of non-significant variants are considerably random, with the extent of randomness associated with the degree of overfitting to population structure of the discovery cohort. Our findings explain why highly parameterized PGSs simultaneously have superior cohort-specific performance and limited generalizability, suggesting the critical need for variant randomness tests in PGS evaluation. Supporting code and a dashboard are available at https://github.com/songlab-cal/StratPGS.

2.
bioRxiv ; 2024 Jan 04.
Artículo en Inglés | MEDLINE | ID: mdl-38260588

RESUMEN

The immune system comprises multiple cell lineages and heterogeneous subsets found in blood and tissues throughout the body. While human immune responses differ between sites and over age, the underlying sources of variation remain unclear as most studies are limited to peripheral blood. Here, we took a systems approach to comprehensively profile RNA and surface protein expression of over 1.25 million immune cells isolated from blood, lymphoid organs, and mucosal tissues of 24 organ donors aged 20-75 years. We applied a multimodal classifier to annotate the major immune cell lineages (T cells, B cells, innate lymphoid cells, and myeloid cells) and their corresponding subsets across the body, leveraging probabilistic modeling to define bases for immune variations across donors, tissue, and age. We identified dominant tissue-specific effects on immune cell composition and function across lineages for lymphoid sites, intestines, and blood-rich tissues. Age-associated effects were intrinsic to both lineage and site as manifested by macrophages in mucosal sites, B cells in lymphoid organs, and T and NK cells in blood-rich sites. Our results reveal tissue-specific signatures of immune homeostasis throughout the body and across different ages. This information provides a basis for defining the transcriptional underpinnings of immune variation and potential associations with disease-associated immune pathologies across the human lifespan.

3.
Res Sq ; 2023 Nov 21.
Artículo en Inglés | MEDLINE | ID: mdl-38045283

RESUMEN

We present SLIViT, a deep-learning framework that accurately measures disease-related risk factors in volumetric biomedical imaging, such as magnetic resonance imaging (MRI) scans, optical coherence tomography (OCT) scans, and ultrasound videos. To evaluate SLIViT, we applied it to five different datasets of these three different data modalities tackling seven learning tasks (including both classification and regression) and found that it consistently and significantly outperforms domain-specific state-of-the-art models, typically improving performance (ROC AUC or correlation) by 0.1-0.4. Notably, compared to existing approaches, SLIViT can be applied even when only a small number of annotated training samples is available, which is often a constraint in medical applications. When trained on less than 700 annotated volumes, SLIViT obtained accuracy comparable to trained clinical specialists while reducing annotation time by a factor of 5,000 demonstrating its utility to automate and expedite ongoing research and other practical clinical scenarios.

4.
medRxiv ; 2023 Sep 06.
Artículo en Inglés | MEDLINE | ID: mdl-37732190

RESUMEN

Purpose: The risk of developing age-related macular degeneration(AMD) is influenced by genetic background. In 2016, International AMD Genomics Consortium(IAMDGC) identified 52 risk variants in 34 loci, and a polygenic risk score(PRS) based on these variants was associated with AMD. The Israeli population has a unique genetic composition: Ashkenazi Jewish(AJ), Jewish non-Ashkenazi, and Arab sub-populations. We aimed to perform a genome-wide association study(GWAS) for AMD in Israel, and to evaluate PRSs for AMD. Methods: For our discovery set, we recruited 403 AMD patients and 256 controls at Hadassah Medical Center. We genotyped all individuals via custom exome chip. We imputed non-typed variants using cosmopolitan and AJ reference panels. We recruited additional 155 cases and 69 controls for validation. To evaluate predictive power of PRSs for AMD, we used IAMDGC summary statistics excluding our study and developed PRSs via either clumping/thresholding or LDpred2. Results: In our discovery set, 31/34 loci previously reported by the IAMDGC were AMD associated with P<0.05. Of those, all effects were directionally consistent with the IAMDGC and 11 loci had a p-value under Bonferroni-corrected threshold(0.05/34=0.0015). At a threshold of 5x10 -5 , we discovered four suggestive associations in FAM189A1 , IGDCC4 , C7orf50 , and CNTNAP4 . However, only the FAM189A1 variant was AMD associated in the replication cohort after Bonferroni-correction. A prediction model including LDpred2-based PRS and other covariates had an AUC of 0.82(95%CI:0.79-0.85) and performed better than a covariates-only model(P=5.1x10 -9 ). Conclusions: Previously reported AMD-associated loci were nominally associated with AMD in Israel. A PRS developed based on a large international study is predictive in Israeli populations.

5.
bioRxiv ; 2023 Jan 06.
Artículo en Inglés | MEDLINE | ID: mdl-36711575

RESUMEN

Defining and accounting for subphenotypic structure has the potential to increase statistical power and provide a deeper understanding of the heterogeneity in the molecular basis of complex disease. Existing phenotype subtyping methods primarily rely on clinically observed heterogeneity or metadata clustering. However, they generally tend to capture the dominant sources of variation in the data, which often originate from variation that is not descriptive of the mechanistic heterogeneity of the phenotype of interest; in fact, such dominant sources of variation, such as population structure or technical variation, are, in general, expected to be independent of subphenotypic structure. We instead aim to find a subspace with signal that is unique to a group of samples for which we believe that subphenotypic variation exists (e.g., cases of a disease). To that end, we introduce Phenotype Aware Components Analysis (PACA), a contrastive learning approach leveraging canonical correlation analysis to robustly capture weak sources of subphenotypic variation. In the context of disease, PACA learns a gradient of variation unique to cases in a given dataset, while leveraging control samples for accounting for variation and imbalances of biological and technical confounders between cases and controls. We evaluated PACA using an extensive simulation study, as well as on various subtyping tasks using genotypes, transcriptomics, and DNA methylation data. Our results provide multiple strong evidence that PACA allows us to robustly capture weak unknown variation of interest while being calibrated and well-powered, far superseding the performance of alternative methods. This renders PACA as a state-of-the-art tool for defining de novo subtypes that are more likely to reflect molecular heterogeneity, especially in challenging cases where the phenotypic heterogeneity may be masked by a myriad of strong unrelated effects in the data.

6.
Front Bioinform ; 1: 792605, 2021.
Artículo en Inglés | MEDLINE | ID: mdl-36303752

RESUMEN

Calling differential methylation at a cell-type level from tissue-level bulk data is a fundamental challenge in genomics that has recently received more attention. These studies most often aim at identifying statistical associations rather than causal effects. However, existing methods typically make an implicit assumption about the direction of effects, and thus far, little to no attention has been given to the fact that this directionality assumption may not hold and can consequently affect statistical power and control for false positives. We demonstrate that misspecification of the model directionality can lead to a drastic decrease in performance and increase in risk of spurious findings in cell-type-specific differential methylation analysis, and we discuss the need to carefully consider model directionality before choosing a statistical method for analysis.

7.
PLoS One ; 15(9): e0239474, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32960917

RESUMEN

Worldwide, testing capacity for SARS-CoV-2 is limited and bottlenecks in the scale up of polymerase chain reaction (PCR-based testing exist. Our aim was to develop and evaluate a machine learning algorithm to diagnose COVID-19 in the inpatient setting. The algorithm was based on basic demographic and laboratory features to serve as a screening tool at hospitals where testing is scarce or unavailable. We used retrospectively collected data from the UCLA Health System in Los Angeles, California. We included all emergency room or inpatient cases receiving SARS-CoV-2 PCR testing who also had a set of ancillary laboratory features (n = 1,455) between 1 March 2020 and 24 May 2020. We tested seven machine learning models and used a combination of those models for the final diagnostic classification. In the test set (n = 392), our combined model had an area under the receiver operator curve of 0.91 (95% confidence interval 0.87-0.96). The model achieved a sensitivity of 0.93 (95% CI 0.85-0.98), specificity of 0.64 (95% CI 0.58-0.69). We found that our machine learning algorithm had excellent diagnostic metrics compared to SARS-CoV-2 PCR. This ensemble machine learning algorithm to diagnose COVID-19 has the potential to be used as a screening tool in hospital settings where PCR testing is scarce or unavailable.


Asunto(s)
Betacoronavirus , Técnicas de Laboratorio Clínico/métodos , Infecciones por Coronavirus/diagnóstico , Pacientes Internos , Aprendizaje Automático , Neumonía Viral/diagnóstico , Adulto , Anciano , Área Bajo la Curva , COVID-19 , Prueba de COVID-19 , Técnicas de Laboratorio Clínico/normas , Humanos , Los Angeles , Tamizaje Masivo/métodos , Tamizaje Masivo/normas , Persona de Mediana Edad , Pandemias , Reacción en Cadena de la Polimerasa , Estudios Retrospectivos , SARS-CoV-2
8.
PLoS Genet ; 16(9): e1009018, 2020 09.
Artículo en Inglés | MEDLINE | ID: mdl-32925908

RESUMEN

Reverse causality has made it difficult to establish the causal directions between obesity and prediabetes and obesity and insulin resistance. To disentangle whether obesity causally drives prediabetes and insulin resistance already in non-diabetic individuals, we utilized the UK Biobank and METSIM cohort to perform a Mendelian randomization (MR) analyses in the non-diabetic individuals. Our results suggest that both prediabetes and systemic insulin resistance are caused by obesity (p = 1.2×10-3 and p = 3.1×10-24). As obesity reflects the amount of body fat, we next studied how adipose tissue affects insulin resistance. We performed both bulk RNA-sequencing and single nucleus RNA sequencing on frozen human subcutaneous adipose biopsies to assess adipose cell-type heterogeneity and mitochondrial (MT) gene expression in insulin resistance. We discovered that the adipose MT gene expression and body fat percent are both independently associated with insulin resistance (p≤0.05 for each) when adjusting for the decomposed adipose cell-type proportions. Next, we showed that these 3 factors, adipose MT gene expression, body fat percent, and adipose cell types, explain a substantial amount (44.39%) of variance in insulin resistance and can be used to predict it (p≤2.64×10-5 in 3 independent human cohorts). In summary, we demonstrated that obesity is a strong determinant of both prediabetes and insulin resistance, and discovered that individuals' adipose cell-type composition, adipose MT gene expression, and body fat percent predict their insulin resistance, emphasizing the critical role of adipose tissue in systemic insulin resistance.


Asunto(s)
Tejido Adiposo/metabolismo , Resistencia a la Insulina/fisiología , Obesidad/genética , Adipocitos/metabolismo , Adiposidad , Adulto , Índice de Masa Corporal , Estudios de Cohortes , Diabetes Mellitus Tipo 2/metabolismo , Femenino , Humanos , Resistencia a la Insulina/genética , Masculino , Persona de Mediana Edad , Obesidad/fisiopatología , Estado Prediabético/metabolismo , Estado Prediabético/fisiopatología , Grasa Subcutánea/metabolismo
9.
Sci Rep ; 10(1): 11019, 2020 07 03.
Artículo en Inglés | MEDLINE | ID: mdl-32620816

RESUMEN

Single-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. We observe that snRNA-seq is commonly subject to contamination by high amounts of ambient RNA, which can lead to biased downstream analyses, such as identification of spurious cell types if overlooked. We present a novel approach to quantify contamination and filter droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: (1) human differentiating preadipocytes in vitro, (2) fresh mouse brain tissue, and (3) human frozen adipose tissue (AT) from six individuals. All three data sets showed evidence of extranuclear RNA contamination, and we observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq, our clustering strategy also successfully filtered single-cell RNA-seq data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem.


Asunto(s)
Tejido Adiposo/metabolismo , Encéfalo/metabolismo , Análisis de Secuencia de ARN/métodos , Animales , Perfilación de la Expresión Génica , Humanos , Funciones de Verosimilitud , Ratones , Análisis de la Célula Individual , Aprendizaje Automático Supervisado
10.
Nat Commun ; 11(1): 2891, 2020 06 03.
Artículo en Inglés | MEDLINE | ID: mdl-32493922

RESUMEN

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

11.
Nat Commun ; 11(1): 1971, 2020 04 24.
Artículo en Inglés | MEDLINE | ID: mdl-32332754

RESUMEN

We present Bisque, a tool for estimating cell type proportions in bulk expression. Bisque implements a regression-based approach that utilizes single-cell RNA-seq (scRNA-seq) or single-nucleus RNA-seq (snRNA-seq) data to generate a reference expression profile and learn gene-specific bulk expression transformations to robustly decompose RNA-seq data. These transformations significantly improve decomposition performance compared to existing methods when there is significant technical variation in the generation of the reference profile and observed bulk expression. Importantly, compared to existing methods, our approach is extremely efficient, making it suitable for the analysis of large genomic datasets that are becoming ubiquitous. When applied to subcutaneous adipose and dorsolateral prefrontal cortex expression datasets with both bulk RNA-seq and snRNA-seq data, Bisque replicates previously reported associations between cell type proportions and measured phenotypes across abundant and rare cell types. We further propose an additional mode of operation that merely requires a set of known marker genes.


Asunto(s)
Biología Computacional/métodos , RNA-Seq/métodos , Análisis de la Célula Individual/métodos , Tejido Adiposo/metabolismo , Algoritmos , Perfilación de la Expresión Génica/métodos , Regulación de la Expresión Génica , Genómica , Humanos , Corteza Prefrontal/metabolismo , ARN Citoplasmático Pequeño , Programas Informáticos , Transcriptoma
12.
Invest Ophthalmol Vis Sci ; 61(2): 48, 2020 02 07.
Artículo en Inglés | MEDLINE | ID: mdl-32106291

RESUMEN

Purpose: Anti-vascular endothelial growth factor (VEGF) therapy for neovascular AMD (nvAMD) obtains a variable outcome. We performed a genome-wide association study for anti-VEGF treatment response in nvAMD to identify variants potentially underlying such a variable outcome. Methods: Israeli patients with nvAMD who underwent anti-VEGF treatment (n = 187) were genotyped on a whole exome chip containing approximately 500,000 variants. Genotyping was correlated with delta visual acuity (deltaVA) between baseline and after three injections of anti-VEGF. Top principal components, age, and baseline VA were included in the analysis. Two lead associated variants were genotyped in an independent validation set of patients with nvAMD (n = 108). Results: Linear regression analysis on 5,353,842 variants revealed five exonic variants with an association P value of less than 6 × 10-5. The top variant in the gene VWA3A (P = 1.77 × 10-6) was tested in the validation cohort. The minor allele of the VWA3A variant was associated with worse response to treatment (P = 0.02). The average deltaVA of discovery plus validation was -0.214 logMAR (≈ a gain of 10.7 Early Treatment Diabetic Retinopathy Study letters) for homozygote for the major allele, 0.172 logMAR for heterozygotes (≈ a loss of 8.6 Early Treatment Diabetic Retinopathy Study letters), and 0.21 logMAR for homozygote for the minor allele (≈ a loss of 10.5 Early Treatment Diabetic Retinopathy Study letters). Minor allele carriers had a higher frequency of macular hemorrhage at baseline. Conclusions: An VWA3A gene variant was associated with worse response to anti-VEGF treatment in Israeli patients with nvAMD. The VWA3A protein is a precursor of the multimeric von Willebrand factor which is involved in blood coagulation, a system previously associated with nvAMD.


Asunto(s)
Inhibidores de la Angiogénesis/uso terapéutico , Neovascularización Coroidal , Precursores de Proteínas/genética , Degeneración Macular Húmeda , Anciano , Anciano de 80 o más Años , Neovascularización Coroidal/tratamiento farmacológico , Neovascularización Coroidal/genética , Femenino , Humanos , Israel , Masculino , Persona de Mediana Edad , Análisis de Regresión , Agudeza Visual , Degeneración Macular Húmeda/tratamiento farmacológico , Degeneración Macular Húmeda/genética , Factor de von Willebrand/genética
13.
Nat Commun ; 10(1): 3417, 2019 07 31.
Artículo en Inglés | MEDLINE | ID: mdl-31366909

RESUMEN

High costs and technical limitations of cell sorting and single-cell techniques currently restrict the collection of large-scale, cell-type-specific DNA methylation data. This, in turn, impedes our ability to tackle key biological questions that pertain to variation within a population, such as identification of disease-associated genes at a cell-type-specific resolution. Here, we show mathematically and empirically that cell-type-specific methylation levels of an individual can be learned from its tissue-level bulk data, conceptually emulating the case where the individual has been profiled with a single-cell resolution and then signals were aggregated in each cell population separately. Provided with this unprecedented way to perform powerful large-scale epigenetic studies with cell-type-specific resolution, we revisit previous studies with tissue-level bulk methylation and reveal novel associations with leukocyte composition in blood and with rheumatoid arthritis. For the latter, we further show consistency with validation data collected from sorted leukocyte sub-types.


Asunto(s)
Separación Celular/métodos , Biología Computacional/métodos , Metilación de ADN/genética , Epigénesis Genética/genética , Análisis de la Célula Individual/métodos , Artritis Reumatoide/sangre , Islas de CpG/genética , Humanos , Recuento de Leucocitos , Leucocitos/clasificación , Leucocitos/citología
14.
Genome Biol ; 20(1): 138, 2019 07 12.
Artículo en Inglés | MEDLINE | ID: mdl-31300005

RESUMEN

Methylation datasets are affected by innumerable sources of variability, both biological (cell-type composition, genetics) and technical (batch effects). Here, we propose a reference-free method based on sparse canonical correlation analysis to separate the biological from technical sources of variability. We show through simulations and real data that our method, CONFINED, is not only more accurate than the state-of-the-art reference-free methods for capturing known, replicable biological variability, but it is also considerably more robust to dataset-specific technical variability than previous approaches. CONFINED is available as an R package as detailed at https://github.com/cozygene/CONFINED .


Asunto(s)
Artefactos , Metilación de ADN , Variación Genética , Programas Informáticos , Conjuntos de Datos como Asunto
15.
Nat Commun ; 9(1): 4919, 2018 11 21.
Artículo en Inglés | MEDLINE | ID: mdl-30464216

RESUMEN

Testing for association between a set of genetic markers and a phenotype is a fundamental task in genetic studies. Standard approaches for heritability and set testing strongly rely on parametric models that make specific assumptions regarding phenotypic variability. Here, we show that resulting p-values may be inflated by up to 15 orders of magnitude, in a heritability study of methylation measurements, and in a heritability and expression quantitative trait loci analysis of gene expression profiles. We propose FEATHER, a method for fast permutation-based testing of marker sets and of heritability, which properly controls for false-positive results. FEATHER eliminated 47% of methylation sites found to be heritable by the parametric test, suggesting a substantial inflation of false-positive findings by alternative methods. Our approach can rapidly identify heritable phenotypes out of millions of phenotypes acquired via high-throughput technologies, does not suffer from model misspecification and is highly efficient.


Asunto(s)
Técnicas Genéticas , Carácter Cuantitativo Heredable , Estadística como Asunto , Metilación de ADN , Expresión Génica , Fenotipo
16.
Genome Biol ; 19(1): 141, 2018 09 21.
Artículo en Inglés | MEDLINE | ID: mdl-30241486

RESUMEN

We introduce a Bayesian semi-supervised method for estimating cell counts from DNA methylation by leveraging an easily obtainable prior knowledge on the cell-type composition distribution of the studied tissue. We show mathematically and empirically that alternative methods which attempt to infer cell counts without methylation reference only capture linear combinations of cell counts rather than provide one component per cell type. Our approach allows the construction of components such that each component corresponds to a single cell type, and provides a new opportunity to investigate cell compositions in genomic studies of tissues for which it was not possible before.


Asunto(s)
Recuento de Células/métodos , Metilación de ADN , Teorema de Bayes
17.
J Comput Biol ; 25(7): 794-808, 2018 07.
Artículo en Inglés | MEDLINE | ID: mdl-29932739

RESUMEN

Estimation of heritability is an important task in genetics. The use of linear mixed models (LMMs) to determine narrow-sense single-nucleotide polymorphism (SNP)-heritability and related quantities has received much recent attention, due of its ability to account for variants with small effect sizes. Typically, heritability estimation under LMMs uses the restricted maximum likelihood (REML) approach. The common way to report the uncertainty in REML estimation uses standard errors (SEs), which rely on asymptotic properties. However, these assumptions are often violated because of the bounded parameter space, statistical dependencies, and limited sample size, leading to biased estimates and inflated or deflated confidence intervals (CIs). In addition, for larger data sets (e.g., tens of thousands of individuals), the construction of SEs itself may require considerable time, as it requires expensive matrix inversions and multiplications. Here, we present FIESTA (Fast confidence IntErvals using STochastic Approximation), a method for constructing accurate CIs. FIESTA is based on parametric bootstrap sampling, and, therefore, avoids unjustified assumptions on the distribution of the heritability estimator. FIESTA uses stochastic approximation techniques, which accelerate the construction of CIs by several orders of magnitude, compared with previous approaches as well as to the analytical approximation used by SEs. FIESTA builds accurate CIs rapidly, for example, requiring only several seconds for data sets of tens of thousands of individuals, making FIESTA a very fast solution to the problem of building accurate CIs for heritability for all data set sizes.


Asunto(s)
Estudio de Asociación del Genoma Completo/estadística & datos numéricos , Modelos Estadísticos , Sitios de Carácter Cuantitativo/genética , Simulación por Computador , Genotipo , Humanos , Fenotipo , Polimorfismo de Nucleótido Simple/genética , Programas Informáticos
18.
Genetics ; 207(4): 1275-1283, 2017 12.
Artículo en Inglés | MEDLINE | ID: mdl-29025915

RESUMEN

Testing for the existence of variance components in linear mixed models is a fundamental task in many applicative fields. In statistical genetics, the score test has recently become instrumental in the task of testing an association between a set of genetic markers and a phenotype. With few markers, this amounts to set-based variance component tests, which attempt to increase power in association studies by aggregating weak individual effects. When the entire genome is considered, it allows testing for the heritability of a phenotype, defined as the proportion of phenotypic variance explained by genetics. In the popular score-based Sequence Kernel Association Test (SKAT) method, the assumed distribution of the score test statistic is uncalibrated in small samples, with a correction being computationally expensive. This may cause severe inflation or deflation of P-values, even when the null hypothesis is true. Here, we characterize the conditions under which this discrepancy holds, and show it may occur also in large real datasets, such as a dataset from the Wellcome Trust Case Control Consortium 2 (n = 13,950) study, and, in particular, when the individuals in the sample are unrelated. In these cases, the SKAT approximation tends to be highly overconservative and therefore underpowered. To address this limitation, we suggest an efficient method to calculate exact P-values for the score test in the case of a single variance component and a continuous response vector, which can speed up the analysis by orders of magnitude. Our results enable fast and accurate application of the score test in heritability and in set-based association tests. Our method is available in http://github.com/cozygene/RL-SKAT.


Asunto(s)
Estudios de Asociación Genética/estadística & datos numéricos , Marcadores Genéticos , Variación Genética , Genoma/genética , Algoritmos , Simulación por Computador , Humanos , Modelos Genéticos , Fenotipo , Polimorfismo de Nucleótido Simple/genética , Programas Informáticos
19.
Bioinformatics ; 33(14): i325-i332, 2017 Jul 15.
Artículo en Inglés | MEDLINE | ID: mdl-28881982

RESUMEN

MOTIVATION: Epigenome-wide association studies can provide novel insights into the regulation of genes involved in traits and diseases. The rapid emergence of bisulfite-sequencing technologies enables performing such genome-wide studies at the resolution of single nucleotides. However, analysis of data produced by bisulfite-sequencing poses statistical challenges owing to low and uneven sequencing depth, as well as the presence of confounding factors. The recently introduced Mixed model Association for Count data via data AUgmentation (MACAU) can address these challenges via a generalized linear mixed model when confounding can be encoded via a single variance component. However, MACAU cannot be used in the presence of multiple variance components. Additionally, MACAU uses a computationally expensive Markov Chain Monte Carlo (MCMC) procedure, which cannot directly approximate the model likelihood. RESULTS: We present a new method, Mixed model Association via a Laplace ApproXimation (MALAX), that is more computationally efficient than MACAU and allows to model multiple variance components. MALAX uses a Laplace approximation rather than MCMC based approximations, which enables to directly approximate the model likelihood. Through an extensive analysis of simulated and real data, we demonstrate that MALAX successfully addresses statistical challenges introduced by bisulfite-sequencing while controlling for complex sources of confounding, and can be over 50% faster than the state of the art. AVAILABILITY AND IMPLEMENTATION: The full source code of MALAX is available at https://github.com/omerwe/MALAX . CONTACT: omerw@cs.technion.ac.il or ehalperin@cs.ucla.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Metilación de ADN , Epigenómica/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Humanos , Cadenas de Markov , Método de Montecarlo , Sulfitos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...